Language identification with limited resources

نویسندگان

  • Emilio Sanchis Arnal
  • Mayte Giménez
  • Lluís F. Hurtado
چکیده

Language identification is an important issue in many speech applications. We address this problem from the point of view of classification of sequences of phonemes, given the assumption that each language has its own phonotactic characteristics. In order to achieve this classification, we have to decode the speech utterances in terms of phonemes. The set of phonemes must be the same for all the languages, because the goal is to have a comparable representation of the acoustic sequences. We followed two different approaches using the same acoustic model: we decode the audio using trigrams of sequences of phonemes and equiprobable unigrams of phonemes as language model. Then a classification process based on perplexity is performed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Patient Safety and Healthcare Quality: The Case for Language Access

This paper aims to provide a description of the need for Culturally and Linguistically Appropriate Services (CLAS) for Limited English Proficient (LEP) patients, an identification of how the lack of CLAS for LEP patients can compromise patient safety and healthcare quality, and discuss barriers to the provision of CLAS.

متن کامل

Phonotactic spoken language identification with limited training data

We investigate the addition of a new language, for which limited resources are available, to a phonotactic language identification system. Two classes of approaches are studied: in the first class, only existing phonetic recognizers are employed, whereas an additional phonetic recognizer in the new language is created for the second class. It is found that the number of acoustic recognizers emp...

متن کامل

Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks

Long Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) have recently outperformed other state-of-the-art approaches, such as i-vector and Deep Neural Networks (DNNs), in automatic Language Identification (LID), particularly when dealing with very short utterances (∼3s). In this contribution we present an open-source, end-to-end, LSTM RNN system running on limited computational resources...

متن کامل

Developing Language-tagged Corpora for Code-switching Tweets

Code-switching, where a speaker switches between languages mid-utterance, is frequently used by multilingual populations worldwide. Despite its prevalence, limited effort has been devoted to develop computational approaches or even basic linguistic resources to support research into the processing of such mixedlanguage data. We present a user-centric approach to collecting code-switched utteran...

متن کامل

Using Prolog for Biological Descriptions

We describe a system which performs biological identification on the basis of natural language descriptions. The system parses texts containing large sets of biological descriptions in restricted natural language and constructs a knowledge base. The system can semi-automatically adapt to a text by extending its lexicon and, to a limited extent, its grammar. Prolog features are important in both...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014